automRm: An R Package for Fully Automatic LC-QQQ-MS Data Preprocessing Powered by Machine Learning

Preprocessing of liquid chromatography-mass spectrometry (LC-MS) raw data facilitates downstream statistical and biological data analyses. In the case of targeted LC-MS data, consistent recognition of chromatographic peaks is a main challenge, in particular, for low abundant signals. Fully automatic preprocessing is faster than manual peak review and does not depend on the individual operator. Here, we present the R package automRm for fully automatic preprocessing of LC-MS data recorded in MRM mode. Using machine learning (ML) for detection of chromatographic peaks and quality control of reported results enables the automatic recognition of complex patterns in raw data. In addition, this approach renders automRm generally applicable to a wide range of analytical methods including hydrophilic interaction liquid chromatography (HILIC), which is known for sample-to-sample variations in peak shape and retention time. We demonstrate the impact of the choice of training data set, of the applied ML algorithm, and of individual peak characteristics on automRm’s ability to correctly report chromatographic peaks. Next, we show that automRm can replicate results obtained by manual peak review on published data. Moreover, automRm outperforms alternative software solutions regarding the variation in peak integration among replicate measurements and the number of correctly reported peaks when applied to a HILIC-MS data set. The R package is freely available from gitlab (https://gitlab.gwdg.de/joerg.buescher/automrm).


Table of Contents
Supplementary Table S   Averaged correlation of quantifier and up to 2 qualifiers within peak borders.
QS_T.o Ratio of peak top to highest point outside the peak in quantifier chromatogram.
QS_T.h Ratio of peak top to the higher peak border.
QS_T.l Ratio of peak top to the lower peak border. QS_dRT Deviation between measured and expected retention time.

QS_cor12
Correlation of quantifier to unlabeled qualifier.

QS_cor13
Correlation of quantifier to 13C labeled qualifier.

QS_cor23
Correlation of unlabeled qualifier to 13C labeled qualifier.
QS_relRT Ratio of retention time to total length of chromatographic gradient.
QS_T.o2 Ratio of peak top to highest point outside the peak in unlabeled qualifier chromatogram.
QS_T.o3 Ratio of peak top to highest point outside the peak in 13C labeled qualifier chromatogram. Correlation of peak to gaussian curve.

QS_1vs2
Ratio of measured and expected ratios of signal intensity of quantifier and unlabeled qualifier.
QS_dRTs Deviation between measured and expected retention time after peak alignment.

Gaussian curve
Expected RT RT Initial peak Shifted peak

S-6
Supplementary Table S 2: Example chromatograms from publicly available data sets that demonstrate the performance and limitations of fully automatic data pre-processing by automRm. Please note that the performance of automRm is highly dependent on the data that has been used for training of the machine learning (ML) models and the user input during training. For the generation of these examples, automRm had been trained mostly on HILIC data.

Metabolite Sample and Dataset Chromatogram Comments
Acetyl-CoA 00047975TP36_3Pccm in MTBLS2145 1 Only a single clear peak with little background and good correlation among quantifier and qualifiers result in very high quality score.
Acetyl-CoA 00047976TP48_1Pccm in MTBLS2145 1 Despite a shift in retention time of about half a minute the peak is still recognized with high confidence.

S-7
Acetyl-CoA 00047966TP00_3Pccm in MTBLS2145 1 Even a very small peak is still recognized and of sufficient quality to be reported.
Acetylerucifoline N-oxide P1_100730_PAs_009 in MTBLS429 1 Only a single clear peak with little background and good correlation among quantifier and qualifiers but without a second qualifier results in a moderately high quality score. Of note, most metabolites in the training data set had two qualifiers.
Acetylerucifoline N-oxide P1_100730_PAs_015 in MTBLS429 1 A peak that is close to background is not reported by automRm because of insufficient quality. However, the authors of the original study did report an intensity value.

S-8
Dehydrojacoline P1_100730_PAs_041 in MTBLS429 1 A peak is reported with high quality score by automRm, however the authors of the original study reported zero.

Dehydrojacoline
P1_110513_PAs_006 in MTBLS429 1 A peak is reported with moderately high quality score by automRm, however the authors of the original study reported the smaller peak that better matches the expected retention time (dotted vertical line).
Epigallocatechin T032_TO130_MRM in MTBLS897 1 This peak was not reported by automRm because of insufficient quality. It is likely, that the deviation from the expected retention time by almost 1 min severely reduced the quality score. However, the authors of the original study did report this peak.